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Abstract 

This work proposes a way to align statistical modeling with decision making. We provide a 
method that propagates the uncertainty in predictive modeling to the uncertainty in operational 
cost, where operational cost is the amount spent by the practitioner in solving the problem. The 
method allows us to explore the range of operational costs associated with the set of reasonable 
statistical models, so as to provide a useful way for practitioners to understand uncertainty. To do 
this, the operational cost is cast as a regularization term in a learning algorithm's objective function, 
allowing cither an optimistic or pessimistic view of possible costs, depending on the regularization 
parameter. From another perspective, if we have prior knowledge about the operational cost, for 
instance that it should be low, this knowledge can help to restrict the hypothesis space, and can 
help with generalization. We provide a theoretical generalization bound for this scenario. We also 
show that learning with operational costs is related to robust optimization. 
Keywords: statistical learning theory, optimization, covering numbers, decision theory. 



1. Introduction 

Machine learning algorithms are used to produce predictions, and these predictions are often used 
to make a policy or plan of action afterwards, where there is a cost to implement the policy. In this 
work, we would like to understand how the uncertainty in predictive modeling can translate into 
the uncertainty in the cost for implementing the policy. This would help us answer questions like: 
"What is a reasonable amount to allocate for this task so we can react best to whatever nature 
brings?" , "Can we produce a reasonable probabilistic model, supported by data, where we might 
expect to pay a specific amount?" , and "Can our intuition about how much it will cost to solve a 
problem help us produce a better probabilistic model?" The three questions above cannot answered 
by decision theory, where the goal is to produce a single policy that minimizes expected cost. These 
questions also cannot be answered by robust optimization, where the goal is to produce a single 
policy that is robust to the uncertainty in nature. Those paradigms produce a single policy decision 
that takes uncertainty into account, and the chosen policy might not be a best response policy to 
any realistic situation. In contrast, our goal is to understand the uncertainty and how to react to 
it, using policies that would be best responses to individual situations. 

There are many applications in which this method can be used. For example, in scheduling staff 
for a medical clinic, predictions based on a statistical model of the number of patients might be 
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used to understand the possible policies and costs for staffing. In traffic flow problems, predictions 
based on a model of the forecasted traffic might be useful for determining load balancing policies on 
the network and their associated costs. In online advertising, predictions based on models for the 
payoff and ad-click rate might be used to understand policies for when the ad should be displayed 
and the associated revenue. 

In order to propagate the uncertainty in modeling to the uncertainty in costs, we introduce what 
we call the simultaneous process, where we explore the range of predictive models and corresponding 
policy decisions at the same time. The simultaneous process was named to contrast with a more 
traditional sequential process, where first, data are input into a statistical algorithm to produce a 
predictive model, which makes recommendations for the future, and second, the user develops a 
plan of action and projected cost for implementing the policy. The sequential process is commonly 
used in practice, even though there may actually be a whole class of models that could be relevant 
for the policy decision problem. The sequential process essentially assumes that the probabilistic 
model is "correct enough" to make a decision that is "close enough." 

In the simultaneous process, the machine learning algorithm contains a regularization term 
encoding the policy and its associated cost, with an adjustable regularization parameter. If there is 
some uncertainty about how much it will cost to solve the problem, the regularization parameter can 
be swept through an interval to find a range of possible costs, from optimistic to pessimistic. The 
method then produces the most likely scenario for each value of the cost. This way, by looking at the 
full range of the regularization parameter, we sweep out costs for all of the reasonable probabilistic 
models. This range can be used to determine how much might be reasonably allocated to solve the 
problem. 

Having the full range of costs for reasonable models can directly answer the question in the 
first paragraph regarding allocation, "What is a reasonable amount to allocate for this task so 
we can react best to whatever nature brings?" One might choose to allocate the maximum cost 
for the set of reasonable predictive models for instance. The second question above is "Can we 
produce a reasonable probabilistic model, supported by data, where we might expect to pay a 
specific amount?" This is an important question, since business managers often like to know if 
there is some scenario/decision pair that is supported by the data, but for which the operational 
cost is low (or high); the simultaneous process would be able to find such scenarios directly. To do 
this, we would look at the setting of the regularization parameter that resulted in the desired value 
of the cost, and then look at the solution of the simultaneous formulation, which gives the model 
and its corresponding policy decision. 

Let us consider the third question above, which is "Can our intuition about how much it will 
cost to solve a problem help us produce a better probabilistic model?" The regularization parameter 
can be interpreted to regulate the strength of our belief in the operational cost. If we have a strong 
belief in the cost to solve the problem, and if that belief is correct, this will guide the choice of 
regularization parameter, and will help with prediction. In many real scenarios, a practitioner or 
domain expert might truly have a prior belief on the cost to complete a task. Arguably, a manager 
having this more grounded type of prior belief is much more natural than, for instance, the manager 
having a prior belief on the £2 norm of the coefficients of a linear model, or the number of nonzero 
coefficients in the model. Being able to encode this type of prior belief on cost could potentially 
be helpful for prediction: as with other types of prior beliefs, it can help to restrict the hypothesis 
space and can assist with generalization. In this work, we show that the restricted hypothesis spaces 
resulting from our method can often be bounded by an intersection of an an ig ball with a halfspace 
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- and this is true for many different types of decision problems. We analyze the complexity of this 
type of hypothesis space with a technique based on Maurey's Lemma (Barron, 1993; Zhang, 2002) 
that leads eventually to a counting problem, where we calculate the number of integer points within 
a polyhedron in order to obtain a covering number bound. 

The operational cost regularization term can be the optimal value of a complicated optimization 
problem, like a scheduling problem. This means we will need to solve an optimization problem each 
time we evaluate the learning algorithm's objective. However, the practitioner must be able to solve 
that problem anyway in order to develop a plan of action; it is the same problem they need to solve 
in the traditional sequential process, or using decision theory. Since the decision problem is solved 
only on data from the present, whose labels are not yet known, solving the decision problem may 
not be difficult, especially if the number of unlabeled examples is small. In that case, the method 
can still scale up to huge historical data sets, since the historical data factors into the training 
error term but not the new regularization term, and both terms can be computed. An example 
is to compute a schedule for a day, based on factors of the various meetings on the schedule that 
day. We can use a very large amount of past meeting-length data for the training error term, but 
then we use only the small set of possible meetings coming up that day to pass into the scheduling 
problem. In that case, both the training error term and the regularization term are able to be 
computed, and the objective can be minimized. 

As mentioned above, the simultaneous process is different from decision theory, for instance, 
Bayesian decision theory. A core idea in decision theory is to choose a single policy that maximizes 
expected utility, or minimizes expected cost. Our goal is not to find a single policy that is useful 
on average. In contrast, our goal is to trace out a path of models, their specific (not average) 
optimal-response policies, and their costs. The policy from decision theory may not correspond 
to the best decision for any particular single model, whereas that is something we want in our 
case. We trace out this path by changing our prior belief on the operational cost (that is, by 
changing the strength of our regularization term). In Bayesian decision theory, the prior is over 
possible probabilistic models, rather than on possible costs as in this paper. (Note that one could 
potentially use Bayesian decision theory to produce a posterior, by incorporating a prior belief on 
the operational cost as in this paper, but that is not our goal in this work.) The simultaneous process 
also differs in principle from robust optimization. In robust optimization, one would generally need 
to allocate much more than is necessary for any single realistic situation, in order to produce a 
policy that is robust to almost all situations. However, this is not always true; in fact, we show in 
this work that in some circumstances, while sweeping through the regularization parameter, one 
of the results produced by the simultaneous process is the same as the one coming from robust 
optimization. 

We introduce the sequential and simultaneous processes in Section [2] In Section [3j we give sev- 
eral examples of algorithms that incorporate these operational costs. Our first example application 
is a staffing problem at a medical clinic, where the decision problem is to staff a set of stations that 
patients must complete in a certain order. The time required for patients to complete each station 
is random and estimated from past data. The second example is a real-estate purchasing problem, 
where the policy decision is to purchase a subset of available properties. The values of the properties 
need to be estimated from comparable sales. The third example is a call center staffing problem, 
where we need to create a staffing policy based on historical call arrival and service time informa- 
tion. A fourth example is the "Machine Learning and Traveling Repairman Problem" (ML&TRP) 
where the policy decision is a route for a repair crew. As mentioned above, there is a large subset 



3 



of problems that can be formulated using the simultaneous process that have a special property: 
they are equivalent to robust optimization (RO) problems. Section |4] discusses this relationship and 
provides, under specific conditions, the equivalence of the simultaneous process with RO. Robust 
optimization, when used for decision-making, does not usually include machine learning, nor any 
other type of statistical model, so we discuss how a statistical model can be incorporated within 
an uncertainty set for an RO. Specifically, we discuss how different loss functions from machine 
learning correspond to different uncertainty sets. We also discuss the overlap between RO and the 
optimistic and pessimistic versions of the simultaneous process. 

We consider the implications of the simultaneous process on statistical learning theory in Section 
[5j In particular, we aim to understand how operational costs affect prediction (generalization) 
ability. We show first that the hypothesis spaces for most of the applications in Section [3] can be 
bounded in a specific way - by an intersection of a ball and a halfspace - and this is true regardless of 
how complicated the constraints of the optimization problem are, and how different the operational 
costs are from each other in the different applications. Second, we bound the complexity of this 
type of hypothesis space using a technique based on Maurey's Lemma (Barron 1993 Zhang, 2002) 
that leads eventually to a counting problem, where we calculate the number of integer points within 
a polyhedron in order to obtain a generalization bound. Our results show that it is possible to 
make use of much more general structure in estimation problems, compared to the standard (norm- 
constrained) structures like sparsity and smoothness; further, this additional structure can benefit 
generalization ability. A shorter version of this work appears in ( Tulabandhula and Rudin, 2012). 



2. The Sequential and Simultaneous Processes 

We have a training set of (random) labeled instances, {(xj, yj)}"^]^, where Xi £ X , yi £ y that we 
will use to learn a function /* : X ^ y. Commonly in machine learning this is done by choosing 
/ to be the solution of a minimization problem: 

r G argmin^g^_ l{f{xi),yi) + C2R{f)j , (1) 

for some loss function I : y x y ^ regularizer R : J^""'^ — j. constant C2 and function class 
jrunc^ Here, 3^ C M. Typical loss functions used in machine learning are the 0-1 loss, ramp loss, 
hinge loss, logistic loss and the exponential loss. Function class J^™^ is commonly the class of 
all linear functionals, where an element / S jr«?ic jg q£ ^^iq form where X C W, /3 G W. 

We have used 'unc' in the superscript for J^""'^ to refer to the word "unconstrained," since it 
contains all linear functionals. Typical regularizers R are the ii and £2 norms of (3. Note that 
nonlinearities can be incorporated into J^""'^ by allowing nonlinear features, so that we now would 
have f{x) = X]j=i where {hj}j is the set of features, which can be arbitrary nonlinear 
functions of x; for simplicity in notation, we will equate hj{x) = Xj and have X C M^. 

Consider an organization making policy decisions. Given a new collection of unlabeled instances 
{xi}^i, the organization wants to create a policy vr* that minimizes a certain operational cost 
OpCost(7r, /*, {xi}i). Of course, if the organization knew the true labels for the {xjji's beforehand, 
it would choose a policy to optimize the operational cost based directly on these labels, and would 
not need /*. Since the labels are not known, the operational costs are calculated using the model's 
predictions, the /*(xi)'s. The difference between the traditional sequential process and the new 
simultaneous process is whether /* is chosen with or without knowledge of the operational cost. 
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As an example, consider {xi}i as representing machines in a factory waiting to be repaired, 
where the first feature Xi^i is the age of the machine, the second feature Xi^2 is the condition at 
its last inspection, etc. The value f*{xi) is the predicted probability of failure for Xj. Policy vr* is 
the order in which the machines {xi}i are repaired, which is chosen based on how likely they are 
to fail, that is, and on the costs of the various types of repairs needed. The traditional 

sequential process picks a model /*, based on past failure data without the knowledge of operational 
cost, and afterwards computes vr* based on an optimization problem involving the {/*(Si)}j's and 
the operational cost. The new simultaneous process picks /* and tt* at the same time, based on 
optimism or pessimism on the operational cost of vr*. 

Formally, the sequential process computes the policy according to two steps, as follows. 

Step 1: Create function /* based on {{xi,yi)}i according to ([T]). That is 

/* G argminjg_^„„c /(/(xi), yi) + C2i2(/)^ . 



Step 2: Choose policy vr* to minimize the operational cost, 

TT* G argmin^gnOpCost(7r,/*,{xi}i). 



The operational cost OpCost(7r, /*, {xi}i) is the amount the organization will spend if policy vr is 
chosen in response to the values of {f*{xi)}i. 

To define the simultaneous process, we combine Steps 1 and 2 of the sequential process. We 
can choose an optimistic bias, where we prefer (all else being equal) a model providing lower 
costs, or we can choose a pessimistic bias that prefers higher costs, where the degree of optimism 
or pessimism is controlled by a parameter Ci. in other words, the optimistic bias lowers costs when 
there is uncertainty, whereas the pessimistic bias raises them. The new steps are as follows. 

Step 1: Choose a model f° obeying one of the following: 



Optimistic Bias: f° e argmin 



.4=1 



+C2i?(/) + Ci minOpCost(7r,/,{5;J,) 
Tren 



(2) 



Pessimistic Bias: f° E argmin 



+C2R{f ) - Ci minOpCost (vr, /, {ij,) 
Tren 



(3) 



Step 2: Compute the policy: 

7r° G argmin OpCost (vr, /°, . 
Tren 

When Ci = 0, the simultaneous process becomes the sequential process; the sequential process 
is a special case of the simultaneous process. 

The optimization problem in the simultaneous process can be computationally difficult, par- 
ticularly if the subproblem to minimize OpCost involves discrete optimization. However, if the 
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number of unlabeled instances is small, or if the policy decision can be broken into several smaller 
subproblems, then even if the training set is large, one can solve Step 1 using different types of 



mathematical programming solvers, including MINLP solvers (Bonami et al. , 2008), Nelder-Mead 



(Nelder and Mead, 1965) and Alternating Minimization schemes ( Tulabandhula et al. , 2011b). One 
needs to be able to solve instances of that optimization problem in any case for Step 2 of the se- 
quential process. The simultaneous process is more intensive than the sequential process in that it 
requires repeated solutions of that optimization problem, rather than a single solution. 

The regularization term R{f) can be for example, an £i or £2 regularization term to encourage 
a sparse or smooth solution. 

As the Ci coefficient swings between large values for optimistic and pessimistic cases, the 
algorithm finds the best solution (having the lowest loss with respect to the data) for each possible 
cost. Once the regularization coefficient is too large, the algorithm will sacrifice empirical error in 
favor of lower costs, and will thus obtain solutions that are not reasonable. When that happens, 
we know we have already mapped out the full range of costs for reasonable solutions. This range 
can be used for pre-allocation decisions. 

It is possible for the set of feasible policies 11 to depend on recommendations {/(xi), f{xm)}, 
so that n = n(/, in general. We will revisit this possibility in Section [4| It is also possible 

for the optimization over tt G 11 to be trivial, or the optimization problem could have a closed form 
solution. Our notation does accommodate this, and is more general. 

One should not view the operational cost as a utility function that needs to be estimated, as in 
reinforcement learning, where we do not know the cost. Here one knows precisely what the cost will 
be under each possible outcome. Unlike in reinforcement learning, we have a complicated one shot 
decision problem at hand and have training data as well as future/unlabeled examples on which 
the predictive model makes prediction on. 

The use of unlabeled data {xi}i has been explored widely in the machine learning literature 
under semi-supervised, transductive, and unsupervised learning. In particular, we point out that 



the simultaneous process is not a semi-supervised learning method (see Chapelle et al. , 2006), since 
it does not use the unlabeled data to provide information about the underlying distribution. A 
small unlabeled sample is not very useful for semi-supervised learning, but could be very useful for 
constructing a low-cost policy. The simultaneous process also has a resemblance to transductive 



learning (see Zhu , 2007 ) , whose goal is to produce the output labels on the set of unlabeled examples; 
in this case, we produce a function (namely the operational cost) applied to those output labels. The 
simultaneous process, for a fixed choice of Ci, can also be considered as a multi-objective machine 
learning method, since it involves an optimization problem having two terms with competing goals 



(see Jin 2006) 



2.1 The Simultaneous Process in the Context of Structural Risk Minimization 



In the framework of statistical learning theory (e.g., Vapnik, 1998 Pollard, 1984; Anthony and 



Bartlett 


1999 


Zhang 


2002 



1999 Zhang, 2002), prediction ability of a class of models is guaranteed when the class 
has low "complexity," where complexity is defined via covering numbers, VC (Vapnik-Chervonenkis) 
dimension, Rademacher complexity, gaussian complexity, etc. Limiting the complexity of the hy- 
pothesis space imposes a bias, and the classical image associated with the bias-variance tradeoff 
is provided in Figure [T]^a). The set of good models is indicated on the axis of the figure. Models 
that are not good are either overfitted (explaining too much of the variance of the data, having 
a high complexity), or underfitted (having too strong of a bias and a high empirical error). By 
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a) 




i , I L 



X 



True Error 



Empirical Error 



Model Complexity 



"Underfitting" Good Models "Overfitting" 



b) 



Operational Cost 




Model Complexity 



Good, Low Cost Solution 

Operational Cost 



Good, Low Cost Solution 



Model Complexity 



Figure 1: In all three plots, the x-axis represents model classes with increasing complexity, a) 
Relationship between training error and test error as a function of model complexity, 
b) A possible operational cost as a function of model complexity, c) Another possible 
operational cost. 
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understanding complexity, we can find a model class where both the training error and the com- 
plexity are kept low. An example of increasingly complex model classes is the set of nested classes 
of polynomials, starting with constants, then linear functions, second order polynomials and so on. 

In predictive modeling problems, there is often no one right statistical model when dealing with 
finite datasets, in fact there may be a whole class of good models. In addition, it is possible that a 
small change in the choice of predictive model could lead to a large change in the cost required to 
implement the policy recommended by the model. This occurs, for instance, when costs are based 
on objects (e.g., products) that come in discrete amounts. Figure[l]^b) illustrates this possibility, by 
showing that there may be a variety of costs amongst the class of good models. The simultaneous 
process can find the range of costs for the set of good models, which can be used for allocation of 
costs, as discussed in the first question in the introduction. 

Figure [T]^c) illustrates the third question in the introduction, where we have a strong prior belief 
that the operational cost will not be above a certain fixed amount. Accordingly, we will choose only 
amongst the class of low cost models. This can significantly limit the complexity of the hypothesis 
space, because the set of low-cost good models might be much smaller than the full space of good 
models. Consider, for example, the cost displayed in Figure [T]^c) , where only models on the left 
part of the plot would be considered, since they are low cost models. Because the hypothesis space 
is smaller, we may be able to produce a tighter bound on the complexity of the hypothesis space, 
thereby obtaining a better prediction guarantee for the simultaneous process than for the sequential 
process. In Section [5] we develop results of this type. These results indicate that in some cases, the 
operational cost can be an important quantity for generalization. 

3. Conceptual Demonstrations 

We provide four examples. In the first, we estimate manpower requirements for a scheduling task. 
In the second, we estimate real estate prices for a purchasing decision. In the third, we estimate 
call arrival rates for a call center staffing problem. In the fourth, we estimate failure probabilities 
for manholes (access points to an underground electrical grid). The first three are small scale 
reproducible examples, designed to demonstrate new types of constraints due to operational costs. 
In the first example, the operational cost subproblem involves scheduling. In the second, it is a 
knapsack problem, and in the third, it is another multidimensional knapsack variant. In the fourth, 
it is a routing problem. In the first, second and fourth examples, the operational cost leads to a 
linear constraint, while in the third example, the cost leads to a quadratic constraint. 

Throughout this section, we will assume that we are working with linear functions / of the form 
so that n(/, {xi}i) is equivalent to n(/3, {xi]i). We will set R{f) to be equal to \\fi\W- We will 
also use the notation to denote the set of linear functions that satisfy an additional property: 

jrR := {/ G : i?(/) < C2*}, 

where C| is a known constant greater than zero. We will use constant C2 from ([T]), and also C| 
from the definition of F^, to control the extent of regularization. C2 is inversely related to C|. We 
use both versions interchangeably throughout the paper. 

3.1 Manpovi^er Data and Scheduling with Precedence Constraints 

We aim to schedule the starting times of medical staff, who work at 6 stations, for instance, 
ultrasound. X-ray, MRI, CT scan, nuclear imaging, and blood lab. Current and incoming patients 
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Figure 2: Staffing estimation with bias on scheduling with precedence constraints. 



need to go through some of these stations in a particular order. The six stations and the possible 
orders are shown in Figure [2] Each station is denoted by a hue. Work starts at the check-in (at 
time vTi) and ends at the check-out (at time vrs). The stations are numbered 6-11, in order to avoid 
confusion with the times vti-tts. The clinic has precedence constraints, where a station cannot be 
staffed (or work with patients) until the preceding stations are likely to finish with their patients. 
For instance, the check-out should not start until all the previous stations finish. Also, as shown in 
Figure [2} station 11 should not start until stations 8 and 9 are complete at time tt^, and station 9 
should not start until station 7 is complete at time ir^. (This is related to a similar problem called 
planning with preference posed by F. Malucelli, Politecnico di Milano). 

The operational goal is to minimize the total time of the clinic's operation, from when the 
check- in happens at time vri until the check-out happens at time vrs. We estimate the time it takes 
for each station to finish its job with the patients based on two variables: the new load of patients 
for the day at the station, and the number of current patients already present. The data is available 
as manpower in the R-package bestglm, using "Hour," "Load" and "Stay" columns. The training 
error is chosen to be the least squares loss between the estimated time for stations to finish their 
jobs (P'^Xi) and the actual times it took to finish (y,). The unlabeled data are the new load and 
current patients present at each station for a given period, given as xq,..,xii. Let vr denote the 
5-dimensional real vector with coordinates tti, 715. 

The operational cost is the total time tt^ — tti- Step 1, with an optimistic bias, can be written 



as: 



mm 



\\^<Ci} 



y'(yi - l^'^Xif + Ci min (vrs - vri), 

*1 ^ 7rGn(/3,{iJi) 



(4) 



where the feasible set n(/3, {xi}i) is defined by the following constraints: 



TTa + f3'xi< vr;,; (a, i, b) G {(1, 6, 2), (1, 7, 3), (2, 8, 4), (3, 9, 4), (2, 10, 5), (4, 11, 5)} 
TTa > for a = 1, 5. 



To solvej4|_ given values of Ci and C2, we used a function-evaluation-based scheme called Nelder- 
Mead (Nelder and Mead, 1965) where at every iterate of f3, the subproblem for vr was solved 
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Figure 3: Left: Operational cost vs Ci. Center: Penalized training loss vs Ci. Right: R-squared 
statistic. Ci = corresponds to the baseline, which is the sequential formulation. 



to optimality (using GurobQ. C2 was chosen heuristically based on ([I]) and kept fixed for the 
experiment beforehand. 

Figure [3] shows the operational cost, training loss, and statistic]^ for various values of Ci. For 
Ci values between and 0.2, the operational cost varies substantially, by ~16%. The values for 
both training and test vary much less, by ~3.5%, where the best value happened to have the largest 
value of Ci. For small datasets, there is generally a variation between training and test: for this 
data split, there is a 3.16% difference in between training and test for plain least squares, and 
this is similar across various splits of the training and test data. This means that for the scheduling 
problem, there is a range of reasonable predictive models within about ~3.5% of each other. 

What we learn from this, in terms of the three questions in the introduction, is that: 1) There is 
a wide range of possible costs within the range of reasonable optimistic models. 2) We have found 
a reasonable scenario, supported by data, where the cost is 16% lower than in the sequential case. 
3) If we have a prior belief that the cost will be lower, the models that are more accurate are the 
ones with lower costs, and therefore we may not want to designate the full cost suggested by the 
sequential process. We can perhaps designate up to 16% less. 

Connection to learning theory: In the experiment, we used tradeoff parameter Ci to provide 
a soft constraint. Considering instead the corresponding hard constraint min7r(vr5 — vri) < a, the 
total time must be at least the time for any of the 3 paths in Figure [2j and thus at least the average 
of them: 

a > min vrs — vri 

7ren{/3,{xjj 

> max{(x6 + Xio)^/3, (xq + Xg + Xiif(3, (xy + Xg + Xii)^/3} 

>z^P (5) 

where 

^ = Q [(^6 + Xio) + (X6 + Xg + Xii) + (X7 + Xg + Xn)]. 



1. Gurobi Optimizer v3.0, Gurobi Optimization, Inc. 2010 

2. If tji are the predicted labels and y is the mean of {yi, j/n}, then the value of the statistic is defined as 
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The main result in Section [5j Theorem [6j is a learning theoretic guarantee in the presence of this 
kind of arbitrary linear constraint, z"^ P < a. 

3.2 Housing Prices and the Knapsack Problem 

A developer will purchase 3 properties amongst the 6 that are currently for sale. She will remodel 
them at fixed costs and sell them for a profit. The fixed costs for the 6 properties are denoted 
{cjjf^^. She estimates the value of each property from data regarding historical sales, in this case. 



from the Boston Housing data set (Prank and Asuncion 2010), which has 13 features. Let policy 
TT G {0, 1}^ be the 6-dimensional binary vector that indicates the properties she purchases. The 
training loss is chosen to be the sum of squares error between the estimated prices /3^Xj and the true 
house prices yi for historical sales. The cost (in this case, profit) is the sum of the three property 
values plus the costs for repair work. A pessimistic bias on profit is chosen to motivate a min-max 
formulation. The resulting (mixed-integer) program for Step 1 of the simultaneous process is: 



mm 

/3e{/3:/3e]RiM|/; 



Xi 



+Ci 



rnax^^ ^^(/J-^Xj + Ci)iTi subject to vTj < 3 



1=1 



1=1 



(6) 



Notice that the second term above is a 1-dimensional {0, 1} knapsack instance. Since the set of 
policies n does not depend on /3, we can rewrite ([g]) in a cleaner way that was not possible directly 
with Q: 



mm max 

P TT 



1=1 



subject to 



i=l 



/? G {/3 : /3 G MiM|/3||2 < C2*} 



vr G < TT : vr G {0,l}^^7^^ < 3 
I 1=1 



(7) 



To solve ([T]) with user-defined parameters C\ and C2, we use fminimax, available through 
Matlab's Optimization toolboji]^ 

For the training and unlabeled set we chose, there is a change in policy above and below 
Cl = 0.05, where different properties are purchased. Figure |4] shows the operational profit, the 
training loss, and values for a range of C\. The training loss and values change by less than 
~3.5%, whereas the operational profit changes about 6.5%. The pessimistic bias shows that even 
if the developer chose the best response policy to the prices, she might make on the order of 6.5% 
less if she is unlucky. Also, we can now produce a realistic model where she would make 6.5% less. 
We can use this model to help her understand the uncertainty involved in her investment. 

Before moving to the next application of the proposed framework, we provide a bound analogous 
to that of ([5]) . Let us replace the soft constraint represented by term 2 of ([6]) with a hard constraint 



3. ver 5.1, Matlab R2010b, Mathworks, Inc. 
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Figure 4: Left: Operational profit vs Ci. Center: Penalized training loss vs Ci. Right: R-squared 
statistic. Ci = corresponds to the baseline, which is the sequential formulation. 



and then obtain a lower bound: 



a> max V(/3^Si)7r, > V(/3^Xi)^-, (8) 

^6{0,l}«,ELi^.<3'rr ^ 



where vr' is some feasible solution of the linear programming relaxation of this problem which also 
gives a lower objective value. For instance picking vr^ = 0.5 for i = 1, ... ,6 is a valid lower bound 
giving us a looser constraint. The constraint can be rewritten: 

(^^^^) 

This is again a linear constraint on the function class parametrized by (3, that we can use for the 
analysis in Section [5] 

Note that if all six properties were being purchased by the developer instead of three, the 
knapsack problem would have a trivial solution and the regularization term would be explicit 
(rather than implicit). 



3.3 A Call Center's Workload Estimation and Staff Scheduling 

A call center management wants to come up with the per-half-hour schedule for the staff for a 
given day between 10am to 10pm. The staff on duty should be enough to meet the demand 
based on call arrival estimates N{i),i = 1,...,24. The staff required will depend linearly on the 
demand per half-hour. The demand per half-hour in turn will be computed based on the Erlang 



C model ( Aldor-Noiman et al. , 2009) which is also known as the square-root staffing rule. This 
particular model relates the demand D{i) to the call arrival rate N{i) in the following manner: 
D{i) oc N{i) + cy^ N{i) where c determines where on the QED (Quality Efficiency Driven) curve 
the center wants to operate on. We make the simplifying assumptions that the service time for 
each customer is constant, and that the coefficient c is 0. 

If we know the call arrival rate N{i), we can calculate the staffing requirements during each 
half hour. If we do not know the call arrival rate, we can estimate it from past data, and make 
optimistic or pessimistic staffing decisions. 
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Figure 5: The three shifts for the cah center. The cehs represent half-hour periods, and there are 
24 periods per work day. Work starts at 10am and ends at 10pm. 



There are additional staffing constraints as shown in Figure [5j namely, there are three sets of 
employees who work at the center such that: the first set can work only from 10am-3pm, the second 
can work from l:30pm-6:30pm, and the third set works from 5pm-10pm. The operational cost is 
the total number of employees hired to work that day (times a constant, which is the amount each 
person is paid). The objective of the management is to reduce the number of staff on duty but at 
the same time maintain a certain quality and efficiency. 



The call arrivals are modeled as a poisson process ( Aldor-Noiman et al. , 2009). What previous 



studies (Brown et al. , 2001) have discovered about this estimation problem is that the square root 



of the call arrival rate tends to behave as a linear function of several features, including: day of the 
week, time of the day, whether it is a holiday/irregular day, and whether it is close to the end of 
the billing cycle. 

Data for call arrivals and features were collected over a period of 10 months from Mid-February 



2004 to the end of December 2004, using the same dataset described by Aldor-Noiman et al. (2009) 



After converting categorical variables into binary encodings (e.g., each of the 7 weekdays into 6 
binary features) the number of features is 36, and we randomly split the data into a training set 
and test set (2764 instances for training; another 3308 for test). 



We now formalize the optimization problem for the simultaneous process. Let policy vr E Zi]_ 
be a size three vector indicating the number of employees for each of the three shifts. The training 
loss is the sum of squares error between the estimated square root of the arrival rate P'^Xi and 
the actual square root of the arrival rate yi := N{i). The cost is proportional to the total 
number of employees signed up to work, J2i'^i- optimistic bias on cost is chosen, so that the 
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Figure 6: Left: Operational cost vs Ci. Center: Penalized training loss vs Ci. Right: R-squared 
statistic. Ci = corresponds to the baseline, which is the sequential formulation. 



(mixed-integer) program for Step 1 is: 



miny^TT^ subject to afir > {P^Xif' for i = l,...,24,7r G Zj 



(9) 



where Figure [5] illustrates the matrix A with the shaded cells containing entry 1 and elsewhere. 
The notation a,- indicates the i""^ row of A: 



aiU) 



1 if staff j can work in half-hour period i 







otherwise. 



(10) 



To solve ^ we first relax the £2-iiorm constraint on /3 by adding another term to the function 
evaluation, namely C2||/3||2- This, way we can use a function-evaluation based scheme that works 
for unconstrained optimization problems. As in the manpower scheduling example, we used an 
implementation of the Nelder-Mead algorithm, where at each step, Gurobi was used to solve the 
mixed-integer subproblem for finding the policy. 

Figure [g] shows the operational cost, the training loss, and values for a range of Ci. The 
training loss and values change only ~1.6% and ~3.9% respectively, whereas the operational 
profit changes about 9.2%. The optimistic bias shows that the management might incur operational 
costs on the order of 9% less if they are lucky. Further the simultaneous process produces a 
reasonable model where costs are about 9% less. If the management team believes they will be 
reasonably lucky, they can justify designating substantially less than the amount suggested by the 
traditional sequential process. 

Let us now investigate the structure of the operational cost regularization term we have in Q. 
For convenience, let us stack the quantities {ff^Xi)^ as a vector h G M?'^. Also let boldface symbol 
1 represent a vector of all ones. If we replace the soft constraint represented by term 2 with a hard 
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constraint with an upper bound a, we get: 

3 (J.-) 3 24 

a> min }^1'^tt > min }^ I'^n = max ^^;^(/3^x^)^ 

7reZl;A7r>b'^ 7reMl;A7r>fe^ «)eMi*;^r«)<l 

+ ~ 1=1 + ~ 1=1 + ~ 1=1 

i=l 

Here a is related to the choice of Ci and is fixed, (f) represents an LP relaxation of the integer 
program with vr now belonging to the positive orthant rather than the cartesian product of set of 
positive integers. (J) is due to LP strong duality and (*) is by choosing an appropriate feasible dual 
variable. Specifically, we pick Wi = for i = 1, . . . , 24, which is feasible because staff cannot work 
more than 10 half hour shifts (or 5 hours). With the three inequalities, we now have a constraint 
on (3 of the form: 

24 

Y^{0^Xif < 10a. 

i=l 

This is a quadratic form in P and gives an ellipsoidal feasible set. We already had a simple ellipsoidal 
feasibility constraint while defining the minimization problem of ^ of the form ||/3||2 < C|. Thus, 
we can see that our effective hypothesis set (the set of linear functionals satisfying these constraints) 
has become smaller. This in turn affects generalization. We are investigating generalization bounds 
for this type of hypothesis set in separate ongoing work. 



3.4 The Machine Learning and Traveling Repairman Problem (ML&:TRP) of 
dTulabandhula et al.| , |2011a| [b|) 

Recently, power companies have been investing in intelligent "proactive" maintenance for the power 
grid, in order to enhance public safety and reliability of electrical service. For instance. New 
York City has implemented new inspection and repair programs for manholes, where a manhole 
is an access point to the underground electrical system. Electrical grids can be extremely large 
(there are on the order of 23,000-52,000 manholes in each borough of NYC), and parts of the 
underground distribution network in many cities can be as old as 130 years, dating from the time 
of Thomas Edison. Because of the difficulties in collecting and analyzing historical electrical grid 
data, electrical grid repair and maintenance has been performed reactively (fix it only when it 
breaks), until recently (Urbina, 2004). These new proactive maintenance programs open the door 
for machine learning to assist with smart grid maintenance. 

Machine learning models have started to be used for proactive maintenance in NYC, where 
supervised ranking algorithms are used to rank the manholes in order of predicted susceptibility to 



failure (fires, explosions, smoke) so that the most vulnerable manholes can be prioritized (Rudin 



et al. , 2010, 2011b|a). The machine learning algorithms make reasonably accurate predictions 



of manhole vulnerability; however, they do not (nor would they, using any other prediction-only 
technique) take the cost of repairs into account when making the ranked lists. They do not know 
that it is unreasonable, for example, if a repair crew has to travel across the city and back again for 
each manhole inspection, losing important time in the process. The power company must solve an 
optimization problem to determine the best repair route, based on the machine learning model's 
output. We might wish to find a policy that is not only supported by the historical power grid data 
(that ranks more vulnerable manholes above less vulnerable ones), but also would give a better 
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route for the repair crew. An algorithm that could find such a route would lead to an improvement 
in repair operations on NYC's power grid, other power grids across the world, and improvements 
in many different kinds of routing operations (delivery trucks, trains, airplanes). 

The simultaneous process could be used to solve this problem, where the operational cost is 
the price to route the repair crew along a graph, and the probabilities of failure at each node in 
the graph must be estimated. We call this the "the machine learning and traveling repairman 



problem" (ML&TRP) and in our ongoing work ( Tulabandhula et al. , 2011a), we have developed 



several formulations for the ML&TRP. We demonstrated, using manholes from the Bronx region 
of NYC, that it is possible to obtain a much more practical route using the ML&TRP, by taking 
the cost of the route optimistically into account in the machine learning model. We showed also 
that from the routing problem, we can obtain a linear constraint on the hypothesis space, in order 
to apply the generalization analysis of Section [5j 

4. Connections to Robust Optimization 

The goal of robust optimization (RO) is to provide the best possible policy that is acceptable under 
a wide range of situations]^ This is different from the simultaneous process, which aims to find the 
best policies and costs for specific situations. Note that it is not always desirable to have a policy 
that is robust to a wide range of situations; this is a question of whether to respond to every situation 
simultaneously or whether to understand the single worst situation that could reasonably occur 
(which is what the pessimistic simultaneous formulation handles). In general, robust optimization 
can be overly pessimistic, requiring us to allocate enough to handle all reasonable situations; it can 
be substantially more pessimistic than the pessimistic simultaneous process. 

In robust optimization, if there are several real-valued parameters involved in the optimization 
problem, we might declare a reasonable range, called the "uncertainty set," for each parameter {e.g. 
ai G [9, 10], 02 £ [1, 2]). Using techniques of RO, we would minimize the largest possible operational 
cost that could arise from parameter settings in these ranges. Estimation is not usually involved 



in the study of robust optimization (with some exceptions, see Ben-Tal et al. , 2009b , who consider 
support vector machines). On the other hand, one could choose the uncertainty set according to a 
statistical model, which is how we will build a connection to RO. Here, we choose the uncertainty 
set to be the class of models that fit the data to within e, according to some fitting criteria. 

The major goals of the field of RO include algorithms, geometry, and tractability in finding the 
best policy, whereas our work is not concerned with finding a robust policy, but we are concerned 
with estimation, taking the policy into account. Tractability for us is not always a main concern as 
we need to be able to solve the optimization problem, even to use the sequential process. Using even 
a small optimization problem as the operational cost might have a large impact on the model and 
decision. If the unlabeled set is not too large, or if the policy optimization problem can be broken 
into smaller subproblems, there is no problem with tractability. An example where the policy 
optimization might be broken into smaller subproblems is when the policy involves routing several 
different vehicles, where each vehicle must visit part of the unlabeled set; in that case there is a 
small subproblem for each vehicle. On the other hand, even though the goals of the simultaneous 
process and RO are entirely different, there is a strong connection with respect to the formulations 
for the simultaneous process and RO, and a class of problems for which they are equivalent. We 
will explore this connection in this section. 



4. http://en.wikipedia.org/wiki/Robust_optimization 
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There are other methods that consider uncertainty in optimization, though not via the lens of 
estimation and learning. In the simplest case, one can perform both local and global sensitivity 
analysis for linear programs to ascertain uncertainty in the optimal solution and objective, but these 



techniques generally only handle simple forms of uncertainty (Bertsimas and Tsitsiklis 1997). Our 
work is also related to stochastic programming, where the goal is to find a policy that is robust to 
almost all of the possible circumstances (rather than all of them), where there are random variables 



governing the parameters of the problem, with known distributions (Birge and Louveaux, 1997). 
Again, our goal is not to find a policy that is necessarily robust to (almost all of) the worst cases, 
and estimation is again not the primary concern for stochastic programming, rather it is how to 
take known randomness into account when determining the policy. 

4.1 Equivalence Between RO and the Simultaneous Process in Some Cases 

In this subsection we will formally introduce RO. In order to connect RO to estimation, we will 
define the uncertainty set for RO, denoted Tgood-, to be models for which the average loss on the 
sample is within e of the lowest possible. Then we will present the equivalence relationship between 
RO and the simultaneous process, using a minimax theorem. 

In Section [2| we had introduced the notation {(xi,yi)}i and {xi\i for labeled and unlabeled 
data respectively. We had also introduced the class J^""'^ in which we were searching for a function 
/* by minimizing an objective of the form ([T]). The uncertainty set J- good will turn out to be a 
subset of J^""-'^ that depends on {{xi,yi)}i and /* but not on {xi\i. 

We start with plain (non-robust) optimization, using a general version of the vanilla sequential 
process. Let / denote an element of the set Tgood-, where / is pre-determined, known and fixed. 
Let the optimization problem for the policy decision tt be defined by: 

min OpCost(7r, /; {xj}), (Base problem) (11) 

7ren(/;{x},) 

where n(/; {xj}) is the feasible set for the optimization problem. Note that this is a more general 
version of the sequential process than in Section [2j since we have allowed the constraint set 11 to be 
a function of both / and {xi}i, whereas in ^ and ([3]), only the objective and not the constraint 
set can depend on / and {xi}i. Allowing this more general version of 11 will allow us to relate 



(11) to RO more clearly, and will help us to specify the additional assumptions we need in order 
to show the equivalence relationship. Specifically, in Section [2| OpCost depends on {f,{xi}i) but 
not 11; whereas in RO, generally 11 depends on (/, {xjjj) but not OpCost. The fact that OpCost 
does not need to depend on / and {xi}i is not a serious issue, since we can generally remove their 



dependence through auxiliary variables (Ben-Tal et al. , 2009b). For instance, if the problem is 



a minimization of the form (11), we can use an auxiliary variable, say t, to obtain an equivalent 
problem: 

mint (Base problem reformulated) 

TT.t 

such that vr G n(/; {xi}) 
OpCost(7r, /; {xi}) < t 

where the dependence on (/, {xi}i) is present only in the (new) feasible set. Since we had assumed 
/ to be fixed, this is a deterministic optimization problem (convex, mixed-integer, nonlinear, etc.). 
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Now, consider the case when / is not known exactly but only known to lie in the uncertainty 
set J-good- The robust counterpart to (11) can then be written as: 



min max OpCost(7r, /; {xj}) 

9^'^ good 



[12) 



(Robust counterpart) 

where we obtain a "robustly feasible solution" that is guaranteed to remain feasible for all values 
of / G Tgood- In general, (12) is much harder to solve than (11) and is a topic of much interest 
in the robust optimization community (e.^., see Bertsimas et al. 2011). As we discussed earlier, 
there is no focus in (12) on estimation, but it is possible to embed an estimation problem within 
the description of the set Fgood, which we now define formally. 

In Sectionll (a subset of J"™^) was defined as the set of linear functionals with the property 
that R{f) < CI- That is, 

-F« = {/:/G-F""^ii(/) <C2*}. 
We define Tgood as a subset of by adding an additional property: 

f n n ^ 



good 



(13) 



for some fixed positive real e. In (13), again /* is a solution that minimizes the objective in ([T]) over 
jrunc^ The right hand side of the inequality in (13) is thus constant, and we will henceforth denote 
it with a single quantity C^. Substituting this definition of Fgood in (12), and further making an 
important assumption (denoted Al) that 11 is not a function of (/, {xj},), we get the following 
optimization problem: 



mm max 



OpCost (vr, /, {xj}j) (Robust counterpart with assumptions) (14) 



where Cj* now controls the amount of the uncertainty via the set J-good- 

Before we state the equivalence relationship, we restate the formulations for optimistic and 
pessimistic biases on operational cost in the simultaneous process from ([2]) and ([3]): 



mm 



mm 



,i=l 
n 



Xi},yi) + C2R{f) + Ci min OpCost (vr, /, {xj » 

TrSn 



. Vi) + C2RU) - Ci min OpCost (vr, /, {xi}i) 

TrSn 



(Simultaneous optimistic) 
(Simultaneous pessimistic) (15) 



Apart from the assumption Al on the decision set 11 that we made in (14), we will also assume 
that J-good defined in (13) is convex; this will be assumption A2. If we also assume that the 
objective OpCost satisfies some nice properties (A3), and that uncertainty is characterized via the 
set Fgood, then we can show that the two problems, namely (15) and (14), are equivalent. Let 
denote equivalence between two problems, meaning that a solution to one side translates into the 
solution of the other side for some parameter values (Ci, Cj*, C2, C2). 

Proposition 1 Let n(/; {xj}j) = Yl he compact, convex, and independent of parameters f and 
{xi}i (assumption Al). Let {/ G : Yll)=iKf{^i)^yi) ^ C**} be convex (assumption A2). Let 
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the cost (to be minimized) OpCost(7r, /, {xjjj) be concave continuous in f and convex continuous 
in IT (assumption A3). Then, the robust optimization problem (I4) is equivalent to the pessimistic 
bias optimization problem (15). That is, 



min max 



mm 



OpCost(7r,/, {xji) 
V / {f{xi),yi) + C2RU) - Ci min OpCost (vr, /, {x,},) 



i=l 



Remark 2 That the equivalence applies to linear programs (LPs) is clear because the objective is 
linear and the feasible set is generally a polyhedron, and is thus convex. For integer programs, 
the objective OpCost satisfies continuity, but the feasible set is typically not convex, and hence, 
the result does not generally apply to integer programs. In other words, the requirement that the 
constraint set 11 be convex excludes integer programs. 

To prove Proposition [T| we restate a well-known generalization of von Neumann's minimax 
theorem and some related definitions. 

Definition 3 A linear topological space (also called a topological vector space) is a vector space 
over a topological field (typically, the real numbers with their standard topology) with a topology 
such that vector addition and scalar multiplication are continuous functions. For example, any 
normed vector space is a linear topological space. A function h is upper semicontinuous at a point 
Po if for every e > there exists a neighborhood U of po such that h(p) < h(po) + e for all 
p & U . A function h defined over a convex set is quasi-concave if for all p, q and A G [0, 1] we 
have h{Xp + (1 — X)q) > mm{h{p) , h{q)) . Similar definitions follow for lower semicontinuity and 
quasi- convexity. 



Theorem 4 (Sion's minimax theorem Sion. 1958) Let H be a compact convex subset of a linear 



topological space and 'B be a convex subset of a linear topological space. Let 0(11, ^) be a real function 
on n X S such that 

(i) G{7r, •) is upper semicontinuous and quasi-concave on H for each vr S 11; 

(a) is lower semicontinuous and quasi-convex on 11 for each ^ G H. 

Then 

minsupG(7r,,^) = sup min G(7r, ^) 



We can now proceed to the proof of Proposition ([T]). 
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Proof ( Of Proposition^ We start from the left hand side of the equivalence we want to prove: 

OpCost(7r, /, 

OpCost(7r,/, 



mm max 

(a) 

max mm 

{f&^''■■I:7=lm^^),y^)<ct} 



■ max 



■ mm 



^ ( E Kf(^^)^y^) - - § (^(/) - ^2) + mm OpCost(^, /, {x,},) 

i=l 



V / {f{x^),yi) + C2RU) - Ci min OpCost /, {xi}i) 

— ttGIT 



which is the right hand side of the logical equivalence in the statement of the theorem. In step (a) 
we applied Sion's minimax theorem (Theorem |4]) which is satisfied because of the assumptions we 
made. In step (6), we picked Lagrange coefficients, namely ^ and both of which are positive. 
In particular, CJ" and Ci as well as C| and C2 are related by the Lagrange relaxation equiva- 
lence (strong duality). In (c), we multiplied the objective with Ci throughout, pulled the negative 
sign in front, and removed the constant terms C\ and C2CI and used the following observation: 
maxa — g'(a) = —umiag{a)] and finally, removed the negative sign in front as this does not affect 
equivalence. ■ 

The equivalence relationship of Proposition [T] shows that there is a problem class in which each 
instance can be viewed either as a RO problem or an estimation problem with an operational cost 
bias. We can use ideas from RO to make the simultaneous process more general. Before doing so, 
we will characterize Tgood for several specific loss functions. 



4.2 Creating Uncertainty Sets for RO Using Loss Functions from Machine Learning. 

Let us for simplicity specialize our loss function to the least squares loss. Let X be an n x p matrix 
with each training instance Xi forming the i^^ row. Also let Y be the n-dimensional vector of all 
the labels yi. Then the loss term of ([T]) can be written as: 



n 

E 

i=l 



iVi-P'^Xi)^ = \\Y-X(3\\ 



Let /?* be a parameter corresponding to /* in ([T]) . Then the definition of J^good in terms of the least 
squares loss is: 



good 



{f:fe -F^, \\Y - X(3\\l < \\Y - Xf]*\\l + e} 



{/:/e^M|y-X/3||2<Ci*}. 



Since each / S Tgood corresponds to at least one /3, the optimization of (|lj) can be performed 
with respect to /3. In particular, the constraint \\Y — XPW < CJ" is an ellipsoid constraint on /3. 



For the purposes of the robust counterpart in (12), we can thus say that the uncertainty is of the 



ellipsoidal form. In fact, ellipsoidal constraints on uncertain parameters are widely used in robust 
optimization, especially because the resulting optimization problems often remain tractable. 
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Loss function 


Uncertainty set description 


least squares 


\\Y-XI3\\1 < \\Y-X(3*\\l + e (ellipsoid) 




0-1 loss 






logistic loss 


YJl-i log(l + e-2^'/(^'>)) < YJl-i log(l + e-2^'/*(^' 


exponential loss 






ramp loss 


Yli-i min(l, max(0, 1 - yif{xi))) < Yl-i min(l, max(0, 1 


-yinxi))) + e 


hinge loss 


Er=i max(0, 1 - y^f{xi)) < ELi max(0, 1 - yif* 





Table 1: Table showing a summary of different possible uncertainty set descriptions that are based 
on ML loss functions. 



Box constraints are also a popular way of incorporating uncertainty into robust optimization. 
For box constraints, the uncertainty over the p-dimensional parameter vector /3 = [/3i, /3p]-^ is 
written for i = I, ...,p as LBi < /3i < UBi, where {LBi}i and {UBi}i are real-valued upper and 
lower bounds that together define the box intervals. 

Our main point in this subsection is that one can potentially derive a very wide range of 
uncertainty sets for robust optimization using different loss functions from machine learning. Box 
constraints and ellipsoidal constraints are two simple types of constraints that could potentially be 
the set J-good) which arise from two different loss functions, as we have shown. The least squares 
loss leads to ellipsoidal constraints on the uncertainty set, but it is unclear what the structure 
would be for uncertainty sets arising from the 0-1 loss, ramp loss, hinge loss, logistic loss and 
exponential loss among others. Further, it is possible to create a loss function for fitting data 
to a probabilistic model using the method of maximum likelihood; uncertainty sets for maximum 
likelihood could thus be established. Table |4.2| shows several different popular loss functions and 
the uncertainty sets they might lead to. Many of these new uncertainty sets do not always give 
tractable mathematical programs, which could explain why they are not commonly considered in 
the optimization literature. 

The sequential process for RO. If we design the uncertainty sets as described above, with 
respect to a machine learning loss function, the sequential process described in Section [2] can be 
used with robust optimization. This proceeds in three steps: 

1. use a learning algorithm on the training data to get /*, 

2. establish an uncertainty set based on the loss function and /*, for example, ellipsoidal con- 
straints arising from the least squares loss (or one could use any of the new uncertainty sets 
discussed in the previous paragraph), 

3. use specialized optimization techniques to solve for the best policy, with respect to the un- 
certainty set. 

We note that the uncertainty sets created by the 0-1 loss and ramp loss for instance, are 
non-convex, consequently assumption (A2) and Proposition [l] do not hold for robust optimization 
problems that use these sets. 
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4.3 The Overlap Between The Simultaneous Process and RO. 

On the other end of the spectrum from robust optimization, one can think of "optimistic" opti- 
mization where we are seeking the best value of the objective in the best possible situation (as 
oppose to the worst possible situation in RO). For optimistic optimization, more uncertainty is 
favorable, and we find the best policy for the best possible situation. This could be useful in many 
real applications where one not only wants to know the worst-case conservative policy but also the 
best case risk-taking policy. A typical formulation, following (|12l) can be written as: 



min min OpCost(7r, /; {xj}). (Optimistic optimization) 

ttG U U{g;{x},) feJ^good 

good 

In optimistic optimization, we view operational cost optimistically (minjg jr^^^^ OpCost) whereas in 



the robust optimization counterpart ( 12 ), we view operational cost conservatively (maxjgjr^^^^^ OpCost). 



The policy vr* is feasible in more situations in RO (miuTrgn n) since it must be feasible with 
respect to each g £ J^noodi whereas the OpCost is lower in optimistic optimization (miuTreu a-F n) 

^ good 

since it need only be feasible with respect to at least one of the g^s. Optimistic optimization has not 
been heavily studied, possibly because a (min-min) formulation is relatively easier to solve than its 
(min-max) robust counterpart, and so is less computationally interesting. Also, one generally plans 
for the worst case more often than for the best case, particularly when no estimation is involved. 
In the case where estimation is involved, both optimistic and robust optimization could potentially 
be useful to a practitioner. 

Both optimistic optimization and robust optimization, considered with respect to uncertainty 
sets J-good, have non-trivial overlap with the simultaneous process. In particular, we showed in 
Proposition [l] that pessimistic bias on operational cost is equivalent to robust optimization under 
specific conditions on OpCost and 11. Using an analogous proof, one can show that optimistic 
bias on operational cost is equivalent to optimistic optimization under the same set of conditions. 
Both robust and optimistic optimization and the simultaneous process encompass large classes 
of problems, some of which overlap. Figure [7] represents the overlap between the three classes of 
problems. There is a class of problems that fall into the simultaneous process, but are not equivalent 
to robust or optimistic optimization problems. These are problems where we use operational cost to 
assist with estimation, as in the call center example and ML&TRP discussed in Section[3j Typically 
problems in this class have 11 = n(/;{xi}j). This class includes problems where the bias can be 
either optimistic or pessimistic, and for which Fgood has a complicated structure, beyond ellipsoidal 
or box constraints. There are also problems contained in either robust optimization or optimistic 
optimization alone and do not belong to the simultaneous process. Typically, again, this is when 
n depends on /. Note that the housing problem presented in Section [3] lies within the intersection 
of optimistic optimization and the simultaneous process; this can be deduced from ([T]). 

In Section [5j we will provide statistical guarantees for the simultaneous process. These are 



very different from the style of probabilistic guarantees in the robust optimization literature (Ben 



Tal et al. , 2009a[ ). There are some "sample complexity" bounds (Ben-Tal et al. , 2009b) in the 



RO literature of the following form: how many observations of uncertain data are required (and 
applied as simultaneous constraints) to maintain robustness of the solution with high probability? 
There is an unfortunate overlap in terminology; these are totally different problems to the sample 
complexity bounds in statistical learning theory. From the learning theory perspective, we ask: 
how many training instances does it take to come up with a model /3 that we reasonably know to 
be good? We will answer that question for a very general class of estimation problems. 
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Figure 7: Set based description of the proposed framework (top circle) and its relation to robust 
(right circle) and optimistic (left circle) optimizations. The regions of intersection are 
where the conditions on the objective OpCost and the feasible set H are satisfied. 



5. Generalization Bound with New Linear Constraints 

In this section, we give statistical learning theoretic results for the simultaneous process that involve 
counting integer points in convex bodies. Generalization bounds are probabilistic guarantees, that 
often depend on some measure of the complexity of the hypothesis space. Limiting the complexity 
of the hypothesis space equates to a better bound. In this section, we consider the complexity of 
hypothesis spaces that results from an operational cost bias. 

Generalization bounds have been well established for norm-based constraints on the hypothesis 
space, but the emphasis has been more on qualitative dependence (e.g., using big-0 notation) 
and the constants are not emphasized. On the other hand, for a practitioner, every prior belief 
should reduce the number of instances they need to collect (these may be expensive to obtain) 



and thus constants (even their approximate values) become important ( Bousquet , 2003 ) . We thus 
provide bounds on the covering number for new types of hypothesis spaces, emphasizing the role 
of constants. 

To establish the bound, it is sufficient to provide an upper bound on the covering number, since 



there are large number of probabilistic bounds in the learning theory literature (e.g., Bartlett and 



Mendelson, 2002) that can be applied to our bound in order to obtain a generalization bound, as 
we will do in Theorem 1101 

In Section [3j we showed that a bias on the operational cost can sometimes be transformed into 
linear constraints on model parameter /3 (see equations ([s]) and Q). There is a broad class of 
other problems for which this is true, for example, for several applications presented in Section |3| 
Because we are able to obtain linear constraints for such a broad class of problems, we will analyze 
the case of linear constraints here. The hypothesis we consider is thus the intersection of an iq ball 
and a halfspace. This is illustrated in Figure [8} 

The plan for the rest of the section is as follows. We will introduce the quantities on which our 
main result in this section depends. Then, we will state the main result (Theorem pi). Following 



that, we will build up to a generalization bound (Theorem 10) that incorporates Theorem [6j After 
that will be the proof of Theorem [6} 
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Figure 8: Left: hypothesis space for intersection of good models (circular, to represent ig ball) 
with low cost models (models below cost threshold, one side of wiggly curve). Right: 
relaxation to intersection of a half space with an ig ball. 



Definition 5 (Covering Number, Kolmogorov and Tikhomirov, 1959) Let A he an arbitrary 
set and (r, p) a (pseudo-)metric space. Let \ ■ \ denote set size. 

• For any e > 0, an e- cover for A is a finite set [/ C F (not necessarily (1 A) s.t. \/a £ A,3u £ 
U with dp{a, u) < e. 

• The covering number of A is N{€,A,p) := inf^/ \U\ where U is an e- cover for A. 

We are given the set of n instances S := {xi}^^^ with each Xi G X Q W where X = {x : 
\\x\\r ^ ^b}: 2 < r < oo aud Xf) is a known constant. Let be a probability measure on X. Let 
Xi be arranged as rows of a matrix X. We can represent the columns of X = [xi ... x^]^ with 



,j = 1, ...,p, so X can also be written as [hi - ■ ■ hp]. Define function class J- as the set of 



linear functionals whose coefficients lie in an £g ball and with a set of linear constraints: 



B :-- 



{/ : f{x) 
( 



/3"^x, f3 G B} where 



p 

< Bb, (^jil^j + Si <l,6i> 0,1 = 1, L 
i=i 



where 1/r + 1/q = 1 and {cji}j^i, and Bb are known constants. The linear constraints given 

by the Cji's force the hypothesis space to be smaller, which will help with generalization - this 
will be shown formally by our main result in this section. Let J-"|5 be defined as the restriction of 
with respect to 5. 

Let {cji}j^i be proportional to {cji}j^r. 



Cjl 



Cjin 



^/^X^Bb 



\h 



Vj = 1, and / = 1, L. 



7 \\r 



Define X to be equal to X times a diagonal matrix whose j^^ diagonal element is 



/lo r 



be a positive number. Further, let the set be defined as the set 



Let K 



{ki,...,kp) G ZP : < K,Ycjikj < K Wl = I, . . . , L^ 

j=i j=i 



(16) 
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Let \P^\ be the size of set . As the hnear constraints given by the Cjis force the hypothesis 
space to be smaller, they force \P^\ to be smaller. Define Amin(-^"^-^) to be the smallest eigenvalue 
of the matrix X'^ X, which will thus be non-negative. Using these definitions, we state our main 
result of this section. 

Theorem 6 (Main result, covering number hound) If 



K > max 



XlBl nXlBl 



i/ien iV(V^e, II • II2) < |Pf |. (17) 



The theorem gives a bound on the £2 covering number for the specially constrained class T^g. 
The bound improves as the constraints given by Cj/ on the operational cost become tighter. In 
other words, as the Cji impose more restrictions on the hypothesis space, \P^\ decreases, and the 
covering number bound becomes smaller. This bound can be plugged directly into an established 
generalization bound that incorporates covering numbers, and this is will be done in what follows 
to obtain Theorem 1101 

For this problem we generally assume that n > cp for c > 1. In such a case, Ainin(X^X) can 
be shown to be bounded away from zero for a wide variety of distributions fj,^ (e-g-, sub-gaussian 
zero- mean . When Aniin(^^^) = 0, the covering number bound becomes vacuous. 

Let us introduce some notation in order to state the generalization bound results. Given any 
function / G we would like to minimize the expected future loss (also known as the expected 
risk), defined as: 



i?*-<=(/o/) :=E(,,,)^^^^^p(/(x),y)J = J l{fix),y)dfx;v^y{x,y), 

where Z : 3^ x 3^ — > M is the (fixed) loss function we had previously defined in Section [2j The loss 
on training sample (also known as empirical risk) is: 



n 



n . 
1=1 



We would like to know that R^'^^'^(l o /) is not too much more than i?''™P(/ o /, {(xj, y^)}"), no 
matter which / we choose from J^. A typical form of generalization bound that holds with high 
probability for every function in is 

R''^%1 o f) < ii--P(/ o /, {(x„y,)}^) + Bound(complexity(^),n), (18) 

where the complexity term takes into account the constraints on J^, both the linear constraints, and 
the ^q-ball constraint. Theorem [6] gives an upper bound on the term Bound(complexity(J-"),n) in 



( 18 ) above. In order to show this explicitly, we will give the definition of Rademacher complexity, 
restate how it appears in the relation between expected future loss and loss on training instances, 
and state an upper-bound for it in terms of the covering number. 
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Definition 7 (Rademacher Complexity) The empirical Rademacher complexity of J-\s ^-j^ 

2 



sup — 



i=l 



(19) 



where {cxj} are Rademacher random variables (ai = 1 mt/i pro6. 1/2 anrf —1 wii/i pro&. l/2j. The 
Rademacher complexity is its expectation: TZn{T) = E5^(^^)n ["^(J^^)]. 

The empirical Rademacher complexity TZ{J-'^s) can be computed given S and J^, and by con- 
centration, will be close to the Rademacher complexity. The following result relates the true 
risk to the empirical risk and empirical Rademacher complexity for any function class Ti (see 

and references therein). Let the quantities Ti^s, R^^'^'^{1 o h) and 



Bartlett and Mendelson 



2002 



j^emp^j^ o h, {xi, yili) be analogous to those we had defined for our specific class T. 
Theorem 8 (Rademacher Generalization Bound) For all 6 > 0, with probability at least 1 — 5, V/i S 

n, 



R'^%1 o h) < R^'^-^ii o h, {x„ 2/,}?) + c ■ n{n\s) + ^1 ' ^ 



n 



(20) 



Note that (20) is an explicit form of (|18). We will now relate TZ{J-\s) to covering numbers thus 
justifying the importance of statement (17) in Theore m ml In particula r the following infinite 



chaining argument also known as Dudley's integral (see Talagrand, 2005) relates TZ{J-\s) to the 
covering number of the set J^^g. 

Theorem 9 (Relating Rademacher Complexity to Covering Numbers) We are given that \/x ^ X , 
we have f{x) G [—XbBh,XbBf,]- Then, 



-n{T\s) < 12 



XbBb JQ 



21ogiV(a,^,L2(/u^)) ^^_^^ 



n 



'21ogiV(Vria,-F|s,|| • h) 



n 



da. (21) 



Our main result in Theorem [6] can be used in conjunction with Theorems [8] and [9j to directly 
see how the true error relates to the empirical error and the prior assumptions (the ^g-norm bound 
on /3 and linear constraint on /3 from the operational cost bias). Explicitly, that bound is here. 



Theorem 10 ( Ceneralization Bound for ML with Operational Costs) For aU6 > 0, with probability 
at least 1 — 5, V/ € .F, 



R"'{1 o /) < R'^'il of,{xt,y,}f) + UCXtBi, 



21og inf^^jPe' 



n 



5. The factor 2 in the defining equation (19 1 is not very important. Some authors omit this factor and include it 



explicitly as a pre-factor in, for example. Theorem [S] 
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where the infimum is taken over 



Kf > max < 



e Amin(^^^)mini=i,..,L 



> . 



This bound implies that prior knowledge about the operational cost can be important for gener- 
alization. As our prior knowledge on the cost becomes stronger, the size of the hypothesis space 
becomes more restrictive, as seen through the constraints given by the Cji. When this happens, the 
\P^\ terms become smaller, and the whole bound becomes smaller. 

Before we move onto building the necessary tools to prove Theorem [6j we compare our result 
with the bound in our work on the ML&TRP ( [Tulabandhula et al.| |2011b|aD . In that work, 
we considered a linear function class with a constraint on the ^2-iiorm and one additional linear 
inequality constraint on (3. We then used a sample independent volumetric cap argument to get 
a covering number bound. Theorem [g] is in some ways an improvement of the other result: (1) 
we can now have multiple linear constraints on /3; (2) our new result involves a sample-specific 
bounding technique for covering numbers, which is generally tighter; (3) our result applies to iq 
balls for any q > 1 whereas the previous analysis holds only for q = 2. The volumetric argument 
in (Tulabandhula et al. , 2011b|a ) provided a scaling of the covering number. Specifically, the 
operational cost term for the ML&TRP allowed us to reduce the covering number term in the 
bound from y^log A^(-, •, •) to Y^log(aA^(-, •, || • II2)), or equivalently y^log A^(-, •, || • II2) + loga. If 
the scaling constant a obeys a <C 1, then there is a noticeable effect on the generalization bound, 
compared to almost no effect when a ~ 1. In the present work, the bound does not scale the 
covering number like this, instead it is a very different approach giving a more direct bound. 



5.1 Proof of Theorem [6] 

Our new result is in the spirit of 



Zhang (2002), whose result is restated below and makes use 



of Maurey's Lemma (Barron, 1993). The main ideas of Maurey's Lemma are used in many ma- 



chine learning papers in various contexts {e.g., Koltchinskii and Panchenko, 2005 Schapire et al 
1998 Rudin and Schapire 2009). Our proof of Theorem [6] adapts Maurey's Lemma to handle 



polyhedrons, and allows us to apply counting techniques to bound the covering number. 



Recall that X 



\Xl 



t]'^ was also defined column- wise as [hi ... hp]. We introduce two 



scaled sets {hj}j and corresponding to {hj}j and {/3j}j as follows: 



hj '.= — n", — n — '^^^ J = li •••iP; and 



\hj\\r 

\h 



■1 \\r 



f3j for j = 1, ...,p. 



These scaled sets will be convenient in places where we do not want to carry the scaling terms 
separately. 
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Any vector y that is equal to Xf3 can thus be written in three different ways: 



y = ^^jhj, or 
i=i 
p 

y = ^kh3: or 
i=i 

p 

y = ^ l^i|sign(^j)/ij. 



Our first lemma is a revised statement of Lemma 1 from Zhang ( 2002 ) , whose proof is similar 



to that of Lemma 1 of Barron 



1993 ) which is based on law of large numbers. An alternative proof 



of Lemma 1 from Zhang (2002) is given by (Jones 1992), based on iterative approximation. The 



lemma states that every point y in the convex hull of {hj}j is close to one of the points yx in a 
particular finite set. 



Lemma 11 Let maxj=i_...^p \\hj\\ be less than or equal to some constant b. Ify belongs to the convex 
hull of set {hj}j, then for every positive integer K > 1, there exists yx in the convex hull of K 
points of set such that \\y — ^ 7^- 



Proof Let y be written in the form: 

p 

i=l 

where for each j = l,...,p, > and Y7j=i 7i ^ 1- Let 7^+1 := 1 - Yl%i Ij- 

Consider a discrete distribution V formed by the coefficient vector (71, .., 7^, 7p4.i). Associate 
a random variable h with support set {hi, ...,hp,0}. That is, Pr(h = hi) = ji, i = l,...,p and 
Pr(h = 0)=7p+i. 

Draw K observations {h^, ...,h^} uniformly and independently from V and form the sample 
average := ^ Et=ih'. Here, we are using the superscript index to denote the observation 
number. The mean of this random variable yx is: 

1 ^ 

1=1 

p+i p 
= ^Pr(h = /i,)/i,= j;7,/i,=y 
i=i i=i 

hence Ex)[yif] = y. 
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The expected distance between yi^ and y is: 



Ei^dlvi^ - y\\ 



^v[hK-^v[yK]f]=^ 



Y,iyK-^v[yK])\ 



i=l 



X:Var((y;,),)^=^X:^Var((h), 

1=1 



i=l 



(1) 1 

K 



^(E^[(h)2]-E^[(h), 



i=l 

1 ~ }? 

< — Ei,[||h||2l < — 



(o) l_ 
K 



(23) 



where we have used i to be the index for the i coordinate of the n dimensional vectors, (f) 
follows from the definition of variance coordinate-wise. (*) follows because each component of yx 
is a sample average. (J) also follows from the definition of variance. At step (o), we rewrite the 
previous summations involving squares into ones that use the Hilbert norm. Our assumption on 
max,=i 



\hj\\ tells us that Ex)[||h|p] < 6^ leading to (23). Since the squared Hilbert norm of the 



sample mean is bounded in this way, there exists a yx that satisfies the inequality, so that 



Wk 



y\?< 



K 



The following corollary states explicitly that an approximation to y exists that is a linear 
combination with coefficients chosen from a particular discrete set. 

Corollary 12 For any y and K as considered above, we can find non-negative integers mi, ...,mp 
such that Yl'j=i — ^ ^'^^ Wv ~ Yl'j=i ^^iP — ^• 



This follows immediately from the proof of Lemma 11 choosing mj to be the coefficients of the 
hj^s such that yx = ^hj. 

The above corollary means that counting the number of p-tuple non- negative integers mi , . . . , rup 
gives us a covering of the set that y belongs to. In case of Lemma 11 , this set is the convex hull of 
{hj}j. 

Before we can go further, we need to generalize the argument from the positive orthant of the ii 
ball to handle any coefficients that are in the whole unit-length £i-ball. This is what the following 
lemma accomplishes. 

Lemma 13 Let maxj=i^..,^p \\hj\\ be less than or equal to some constant b. For any y = l^j=i l^jhj 
such that 11/3 111 < 1, we can find a yx such that 

,,2 

||y - VKh < ^ 

where yx = Yl^=i ^f^i ^ combination of {hj} with integers ki,...,kp such that ^j=i \kj\ < K. 
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Proof Lemma cannot be applied directly since the can be negative. We rewrite y or 

equivalently Pjhj as 

p 

y = ^ l/3i|sign(/3j)/ij. 
i=i 

Thus y lies in the convex combination of {sign{f3j)hj}j. Note that this step makes the convex hull 
depend on y or we start with. Nonetheless, we know by substituting {sign{f3j)hj}j for {hj}j 



in the statement of Lemma 11 and Corollary 12 that 

1. we can find yx, or equivalently 

2. we can find non-negative integers mi, ...,mp with X]^=i "^j — 



such that \\y — t/A-lll ^ 7f where yx = Sj=i ^sign(/3j)^j holds. This implies there exist integers 
ki, ...,kp such that yx = Yl^=i where X]j=i 1^1 — We simply let kj = mjsign(/3j). Thus, 
we absorbed the signs of the /3/s, and the coefficients no longer need to be nonnegative. 

In other words, we have shown that if a particular yx is in the convex hull of points {sign(/3j)/ij}j, 
then the same yx is a linear combination of {hj}j where the coefficients of the combination 
ki/K, ...,kp/K obey \ kj\ < K. This concludes the proof ■ 



We now want to answer the question of whether the ki/K, ...,kp/K can obey (related) linear 
constraints if the original did so. These constraints on the {/3j}j's are the ones coming 

from constraints on the operational cost. In other words, we want to know that our (discretized) 
approximation of y also obeys a constraint coming from the operational cost. 

Let {/3j}j satisfy the linear constraints within the definition of in addition to satisfying 
~|i<l: 



Cji^j + (5; < 1, for fixed J; > 0, ^ = 1, L. 

i=i 

We now want that for large enough ET, the p-tuple ki/K, kp/K to also meet certain related linear 
constraints. 

We will make use of the matrix X, defined before Theorem [oj It has the elements of the scaled 
set {hj}j as its columns: X ■.= [hi ... /ip]. 

Lemma 14 Take any y = Pj^j^ '^^'^ '^'^2/ UK = X]?=i if ^i? with: 



Cjij3j + 5i <1, for fixed 5; > 0, / = 1, L where 



and ||y — yK\\2 — ^ 1^ ■ Whenever 

K> - 



mm/=i ... r 



j:U\cji\ 



1 < 1 
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then the following linear constraints on ki/K, kp/K hold: 

f]c,,|<l, l = l,...,L. 



This lemma states that as long as the discretization is fine enough, om' approximation obeys 

similar operational cost constraints to y. 

Proof 

Let K := [ki/K ... kp/K]'^ . Using the definition of X, 



K 



> \\y - vkWI = U/3 - Xk\\1 = \\X0 - 



(/3 - ^fX^Xif] - k) > XminiX' X 



(*) 



(24) 



In (*), we used the fact that for a positive (semi-)definite matrix M and for every non-zero vector 
z, Mz > X,^in{M)z'^ Iz. (If (3 = K, we are done since k will obey the constraints /3 obeys.) Also, 
for any z, in each coordinate j, \zj\ < 
we have: 



< ||2;||2- Combining this with (24), 



A- 



k^ 
K 



< 



KXrniniXTX) 



This implies that k itself component-wise satisfies 

Pj-A<^<i3j+A where A := 



KXrainiX^X) 



So far we know that for all I = 1, L, ^2^=1 ^jil^j + ^ 1; with 5i > 0, and each coordinate 
kj/K within k, varies from Pj by at most an amount A. We would like to establish that the linear 
constraints Yl^=i ^ji'K — ^ ~ ^' ^-Iways hold for such a k. For each constraint /, substituting 
the extremal values of kj according to the sign of cji, we get the following upper bound: 

^ k VP 



c,;>0 



CjKO 



This sum X]^=i '^jiPj + ^ S?=i l^jd less than or equal to 1 iff ^ Z]^=i l^jd — ^i- 



Thus we would like A < 



Si 



for all / = 1, L. That is. 



A < min 



Si 



'KXminiX^X) 

^ K> 



l=l,...,L ELl|c,7| 



n 2 



Amin(^"^-'^) 
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We now proceed with the proof of our main result of this section. The result involves covering 
numbers, where the cover for the set will be the vectors with discretized coefficients that we have 
been working with in the lemmas above. 

Proof ( of Theorem 
Recall that 

• the matrix X is defined as [hi ... hp]; 

• the scaled versions of vector {hj}j are hj = " \\h^^^^ foi" i = ■••jP! 

• the scaled versions of coefficients {f3j}j are /3j = „i/r^^'g Pj for j = 1, and 

• any vector y = X(3 = Yl^=i Pj^i ^^^^ be rewritten as X]^=i l^j^i- 
We will prove three technical facts leading up to the result. 

Fact 1. If II/3II, < then ||^||i < 1. 

Because l/r + l/g = l, by Holder's inequality we have: 

t\H-^^^t\\H\rm (25) 

i=i i=i 

(\ l/r / \ l/q 



To bound the above notice that in our notation, {hj)i = {xi)j. That is, the i*'' component of feature 



vector hj, i.e., {hj)i is also the j*'^ component of example Xj. Thus, 

l/r / \ l/r / \ l/r 

I P n \ I n p ^ 

\i=l / 



Plugging this into (25), and using the fact that \\fi\\q < -Bfe, we have 

P 1 

1/3,1 < -T-, n^'^'XbBi, = 1, 



that is, ||/3||i < 1. 
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Fact 2. Corresponding to the set of linear constraints on (3: 



^Cji^j + 6i < l,6i >0,l = l,...,L, 

there is a set of hnear constraints on f3j, namely X]j=i ^jif^j + <5/ ^ Ij ^ = 1; 

Recall that f3 £ B also means that Yl^=i ^jil^j + <^/ ^ 1 for some 6i > Q for alH = 1, . . . , L. Thus, 



for all I = 1, 



p 
p 



n 



^/''XbBb \\h 



■J \\r 



\hj\\r n^/'''XhBiy 



^ cji^j + < 1 

which is the set of corresponding linear constraints on {/3j}j we want. 
Fact 3. Vj = 1, ...,p, \\hj\\2 < n^^^XbBb. 



Jensen's inequality implies that for any vector z in M", and for any r > 2, it is true that 



^1/2 



UII2 < 



jl/r I 



Using this for our particular vector hj and our given r, we get 



n 



But we know 



n 



||r 



n 



||r 



n 



^I'X.B, 



Thus, we have ||/ij||2 < n^^'^X^Bi) for each j, and thus, maxj=i^...^p ll^jlb < n^^'^X^Bi,. 

With those three facts established, we can proceed with the proof of Theorem |6j Facts 1 and 2 



show that the requirements on /3 for Lemma 13 and Lemma 14 are satisfied. Fact 3 shows that the 
requirement on {hj}j for Lemma 13 is satisfied with constant b being set to n^^^XbBb. Since the 
requirements on {hj}j and are satisfied, we want to choose the right value of positive integer 

K such that Lemma [14] is satisfied and also we would like the squared distance between y and yx 
to be less than ne^. To do this, we pick K to be the bigger of the two quantities: X^S^/e^ and 



that given in Lemma 14 That is 



K > max < 



mm ™ — pr-r 
1=1,.. .,L Ej^i l^jil 



1 2 

5i 



(26) 
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This will force our discretization for the cover to be sufficiently fine that things will work out: we 
will be able to count the number of cover points in our finite set, and that will be our covering 
number. 

To summarize, with this choice, for any y G F\s-: we can find integers ki,...,kp such that the 
following hold simultaneously: 

a. (It gives a valid discretization of y.) J2^=i l^d — 

b. (It gives a good approximation to y.) The approximation yx = Yl^=i x^i close to 
y = Yfj^iPjhj- That is, 

\\y-yK\\l < — ^ <ne2,and 



c. 



(It obeys operational cost constraints.) J2'j=i ^ji^ — 1' ^ = 1; ...,L. 



In the above, the existence of ki,...,kp satisfying (a) and (6) comes from Lemma 13 where we 
have also used K satisfying K > X^B'^/e^ > 1. Lemma 14 along with the choice of K from (26) 



guarantees that (c) holds as well for this choice of ki, ...,kp. 

Thus, by (6), any y G J^^g is within €y/n in £2 distance of at least one of the vectors with 
coefficients ki/K, kp/K. Therefore counting the number of p-tuple integers ki, ...,kp such that 



(a) and (c) hold, or equivalently the number of solutions to (16), gives a bound on the covering 
number, which is l-Pf^l- 



Since Theorem [6] suggests that \P^\ may be an important quantity for the learning process, 
we discuss how to compute it. We assume that Cji are rationals for all j = l,..,p,l = 1,...,L, 
so that we can multiply each of the L constraints describing by the corresponding gcd of the 
p denominators. (This is without loss of generality because the rationals are dense in the reals.) 
This ensures that all the constraints describing polyhedron have integer coefficients. Once this 
is achieved, we can run Barvinok's algorithm (using for example. Lattice Point Enumeration, see 



De Loera, 2005 and references therein) that counts integer points inside polyhedra and runs in 
polynomial time for fixed dimension (which is p here). Using the output of this algorithm within 
our generalization bound will yield a much tighter bound than in previous works (for example, the 



bound in Zhang, 2002, Theorem 3), especially when (r, q) = (00, 1); this is true simply because we 
are counting more carefully. Note that counting integer points in polyhedrons is a fundamental 
question in a variety of fields including number theory, discrete optimization, combinatorics to name 
a few, and making an explicit connection to bounds on the covering number for linear function 
classes can potentially open doors for better sample complexity bounds. 

6. Discussion and Conclusion 

The perspective taken in this work contrasts with traditional decision analysis and predictive mod- 
eling; in these fields, a single decision is often the only end goal. Our goal involves exploring how 
predictive modeling influences decisions and their costs. Unlike traditional predictive modeling, 
our regularization terms involve optimization problems, and are not the usual vector norms. 
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The simultaneous process serves as a way to understand uncertainty in decision-making, and 
can be directly applied to real problems. We centered our discussion and demonstrations around 
three questions, namely: "What is a reasonable amount to allocate for this task so we can react 
best to whatever nature brings?" , "Can we produce a reasonable probabilistic model, supported by 
data, where we might expect to pay a specific amount?" , and "Can our intuition about how much 
it will cost to solve a problem help us produce a better probabilistic model?" These are questions 
that are not handled in a natural way by current paradigms. Answering these three questions 
are not the only uses for the simultaneous process. For instance, domain experts could use the 
simultaneous process to explore the space of probabilistic models and policies, and then simply 
pick the policy among these that most agrees with their intuition. Or, they could use the method 
to refine the probabilistic model, in order to exclude solutions that the simultaneous process found 
that did not agree with their intuition. 

The simultaneous process is useful is cases where there are many potentially good probabilistic 
models, yielding a large number of (optimal-response) policies. This happens when the training 
data are scarce, or the dimensionality of the problem is large compared to the sample size, and the 
operational cost is not smooth. These conditions are not difficult to satisfy, and do occur commonly. 
For instance, data can be scarce (relative to the number of features) when they are expensive to 
collect, or when each each instance represents a real- world entity where few exist; for instance, 
each example might be a product, customer, purchase record, or historic event. Operational cost 
calculations commonly involve discrete optimization; there can be many scheduling, knapsack, 
routing, constraint-satisfaction, facility location, and matching problems, well beyond what we 
considered in our simple examples. The simultaneous process can be used in cases where the 
optimization problem is difficult enough that sampling the posterior of Bayesian models, with 
computing the policy at each round, is not feasible. 

We end the paper by discussing the applicability of our policy-oriented estimation strategy in the 
real world. Prediction is the end goal for machine learning problems in vision, image processing and 
biology, and in other scientific domains, but there are many domains where the learning algorithm 
is used to make recommendations for a subsequent task. We showed applications in Section |3] but 
it is not hard to find applications in other domains, where using either the traditional sequential 
process, decision theory, or robust optimization may not suffice. Here are some other potential 
domains: 



Internet advertising, where the goal of the advertising platform is to choose which ad to show a 
customer. For each customer and advertiser, there is an uncertain estimate of the probability 
that the customer will click the ad from that advertiser. These estimates determine which ad 



will be shown next, which is a discrete decision ( Muthukrishnan et al. 2007) 



Portfolio management, where we allocate our budget among n risky assets with uncertain 



returns, and each asset has a different cost associated with the investment (Konno and Ya- 



mazaki, 1991 ). 



Maintenance applications (in addition to the ML&TRP Tulabandhula et al. , 2011b), where we 



estimate probabilities of failure for each piece of equipment, and create a policy for repairing, 
inspecting, or replacing the equipment. Certain repairs are more expensive than others, so 
the costs of various policy decisions could potentially change steeply as the probability model 
changes. 
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Traffic fiows on transportation networks, where tlie problem can be tliat of load balancing 



based on resource constraints and forecasted demands (Bertsimas and Sim, 2003). 



Policy decisions based on dynamical system simulations, for instance, climate policy, where 
a politician wants to understand the uncertainty in policy decisions based on the results of a 
large-scale simulation. If the simulation cannot be computed for all initial values, its result 



can be estimated using a machine learning algorithm (Barton et al. , 2010) 



Pharmaceutical companies choosing a subset of possible drug targets to test, where the drugs 
are predicted to be effective, and not overly expensive to produce ( Agarwal et al. , 2010). This 



might be similar in many ways to the real-estate purchasing problem discussed in Section [3| 

• Machine task scheduling on multi-core processors, where we need to allocate processors to 
various jobs during a large computation. This could be very similar to the problem of schedul- 
ing with constraints addressed in Section [3j If we optimistically estimate the amount of time 
each job takes, we ensure that we will free up processors on time so they can be ready for the 
next part of the computation. 

We believe the simultaneous process will open the door for other methods dealing with the inter- 
action of machine learning and decision-making that fall outside the realm of the usual paradigms. 
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